So far, we’ve been using just the simple mean to make predictions. Today, we’ll continue using the simple mean to make predictions, but now in a complicated way. Before, when we calculated conditional means, we did so in certain “groupings” of variables. When we run linear regression, we no longer need to do so. Instead, linear regression allows us to calculate the conditional mean of the outcome at every value of the predictor. If the predictor takes on just a few values, then that’s the number of conditional means that will be calculated. If the predictor is continuous and takes on a large number of values, we’ll still be able to calculate the conditional mean at every one of those values.
The model we posit for regression is as follows:
\[Y=\beta_0+\beta_1 x_1 +\beta_2 x_2+ ... \beta_k x_k + \epsilon\]
It’s just a linear, additive model. Y increases or decreases as a function of x, with multiple x’s included. \(\epsilon\) is the extent to which an individual value is above or below the line created. \(\beta\) is the amount that \(Y\) increases or decreases for a one unit change in \(x\).
When working with regression models, there are a series of steps we will work through to build and test our models. There is some flexibility in the order they are done, but for now the easiest way to learn will be to approach them in the following order:
load("area_data.Rdata")
ad<-as.data.frame(area_data)
This data comes from the American Community Survey of 2019. It covers all of the metro or micro statistical areas in the United States. It includes characteristics of these areas, include education, income, home ownership and others as described below.
| Name | Description |
|---|---|
| name | Name of Micro/Metro Area |
| college_educ | Percent of population with at least a bachelor’s degree |
| perc_commute_30p | Percent of population with commute to work of 30 minutes or more |
| perc_insured | Percent of population with health insurance |
| perc_homeown | Percent of housing units owned by occupier |
| geoid | Geographic FIPS Code (id) |
| income_75 | Percent of population with income over 75,000 |
| perc_moved_in | Percent of population that moved from another state in last year |
| perc_in_labor force | Percent of population in labor force |
| metro | Metropolitan Area? Yes/No |
| state | State Abbreviation |
| region | Census Region |
| division | Census Division |
Our dependent variable will be the percent of the population that has a yearly income over $75,000. Let’s take a look at this variable.
#Plots the density distribution
ad%>%
ggplot(aes(x=income_75))+
geom_density()
#We can also look at the five number summary (min, Q1, median Q3, Max)
fiveIncome<-as.data.frame(fivenum(ad$income_75))
colnames(fiveIncome)<-"five.num"
fiveIncome
## five.num
## 1 11.97289
## 2 28.26535
## 3 33.12161
## 4 38.69464
## 5 72.28597
#We can also get the numbers with summary
summary(ad$income_75)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.97 28.27 33.12 34.16 38.69 72.29
summary(ad$income_75)[-4]
## Min. 1st Qu. Median 3rd Qu. Max.
## 11.97289 28.26990 33.12161 38.69215 72.28597
#Adding to the plot just to show you the relationship (you won't need to do this in this class)
ad%>%
ggplot(aes(x=income_75))+
geom_density()+
geom_text(data=fiveIncome, aes(x=five.num,y=-.001,label=format(round(five.num,1),nsmall=1)))+ ylab("density")
This variable has a nice symmetric distribution. It looks approximately normal, which will help in interpreting the results.
ad%>%
ggplot(aes(x=college_educ, y=income_75)) +
geom_point() + #produces scatterplot of % college ed by % income over 75
geom_smooth(method="lm") #adds linear model line of best fit
## `geom_smooth()` using formula 'y ~ x'
#THINK about what each dot represents on this plot.
It’s a good idea to plot the data before running a regression. The
scatterplot above shows the percent of the population with incomes above
75,000 on the y axis and the percent of the population with a bachelor’s
degree on the x axis. I always like to be able to identify the
individual units in datasets like this. To do that I use the
ggplotly command below.
gg<-ad%>%
ggplot(aes(x=college_educ, y=income_75, label=name))+
geom_point()+
geom_smooth(method="lm")
ggplotly(gg)
## `geom_smooth()` using formula 'y ~ x'
By using the plotly command, we can identify the
individual units. Try it out, seeing which units have high or low
incomes or higher or lower percentages of the population with a
bachelor’s degree.
Note: if using plotly you cannot knit as PDF. You will
need to knit as HTML.
The essence of prediction is discovering the extent to which our models generalize or predict outcomes for data that does not come from our sample. Many times this process is temporal. We fit a model to data from one time period, then take predictors from a subsequent time period to come up with a prediction in the future. For instance, we might use data on team performance to predict the likely winners and losers for upcoming soccer games.
This process does not have to be temporal. We can also have data that is out of sample because it hadn’t yet been collected when our first data was collected, or we can also have data that is out of sample because we designated it as out of sample.
The data that is used to generate our predictions is known as training data. The idea is that this is the data used to train our model, to let it know what the relationship is between our predictors and our outcome. So far, we have only worked with training data.
That data that is used to validate our predictions is known as testing data. With testing data, we take our trained model and see how good it is at predicting outcomes using out of sample data.
One very simple approach to this would be to cut our data in half. We could then train our model on half the data, then test it on the other half. This would tell us whether our measure of model fit (e.g. rmse) is similar or different when we apply our model to out of sample data. That’s what we’re going to do in this lesson. We’ll split the data randomly in half, with one half used to train our models and the other half used to test the model against out of sample data.
I use the initial_split command to designate the way
that the data should be split– in this case in half (.5). The testing
data (which is a random half of the original dataset) is stored as
ad_test. The training data is stored as
ad_train. We’ll run our models on
ad_train.
Note: the set.seed command ensures that your random
split should be the same as my random split.
set.seed(35202)
split_data<-ad%>%
initial_split(prop=.5)
ad_train<-training(split_data)
ad_test<-testing(split_data)
We will discuss next week as an organizational framework for running multiple models to compare.
To start training our model, we need to specify a formula. The dependent variable always goes on the left hand side, while the independent variable or variables always go on the right hand side. The formula below is for a model with the percent of the population with incomes above 75,000 on the left hand side, and the percent of the population with a college education on the right hand side.
income_formula<-as.formula("income_75 ~ college_educ")
The next step is to define the model. The type of model I want is a linear regression, which I specify below. This is also sometimes called “ordinary least squares” regression.
lm_fit <-
linear_reg() %>%
set_engine("lm")%>%
set_mode("regression")
We will discuss next week how to use recipes within the workflow framework.
Stay tuned for next week.
Now we’re ready to actually fit the model to the data. In the
fit command below I specify which formula should be used
and which dataset to fit the linear regression specified above.
lm_results<-
lm_fit%>%
fit(income_formula, data=ad_train) #we build/tweak our model in the TRAINING dataset
To examine the results, we can use the summary
command.
summary(lm_results$fit)
##
## Call:
## stats::lm(formula = income_75 ~ college_educ, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.3046 -3.4308 -0.1221 3.3638 24.9082
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.1605 0.8057 23.78 <2e-16 ***
## college_educ 0.6160 0.0311 19.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.969 on 461 degrees of freedom
## Multiple R-squared: 0.4598, Adjusted R-squared: 0.4586
## F-statistic: 392.3 on 1 and 461 DF, p-value: < 2.2e-16
What this tells us is that for each additional percent of the population with a bachelor’s degree, the percent of the population with incomes above 75,000 is predicted to increase by 0.62
Since we have fit the model to the training data and obtained parameter estimates, we are going to use those estimates to generate predicted values for our testing data. We are intentionally constraining the predicted values of the testing data as a means of checking generalizabiity. Since the data all came from the same “population” (the original dataset). The results calculated from one half should fit the other.
ad_test<-lm_results%>% ## start with the results
predict(new_data=ad_test)%>% ## create a prediction
rename(pred_educ=.pred)%>% ## rename the prediction- I'm using a more descriptive name than just "pred" here
bind_cols(ad_test) ## add the prediction to the testing dataset
Now we have the prediction in our testing dataset, named
pred_educ using the rename command. We can compare the
actual value with the predicted value in the testing data using
rmse.
rmse_1<-yardstick::rmse(ad_test,
truth=income_75,
estimate=pred_educ)
rmse_1
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 6.22
The rmse of 6.2 indicates how close our prediction is to the actual value in the testing dataset. This is important because it gives a sense of how the model we trained performs when trying to make predictions using new– or out of sample– data.
Run a regression using a different predictor. Calculate rmse and see if you can beat my score.
We can continue refining our model by adding more variables. Regression with more than one predictor is called multiple regression. The nice thing is that the steps will all be the same, we just need to change the formula and the rest stays pretty much as it was.
Typically, before running an MLR we want to check the correlations between our IVs(predictors) and our DV(outcome) as well as the correlations across our IVs. We WANT our IVs to be highly correlated with our DV (this indicates they might be good predictors, right?) but not highly correlated with each other.
library(sjPlot)
## Warning: package 'sjPlot' was built under R version 4.2.1
## Install package "strengejacke" from GitHub (`devtools::install_github("strengejacke/strengejacke")`) to load all sj-packages at once!
cormat<-tab_corr(ad%>%select(income_75, college_educ, perc_homeown), triangle = "lower", string.diag = c("1","1","1"))
cormat
| income_75 | college_educ | perc_homeown | |
|---|---|---|---|
| income_75 | 1 | ||
| college_educ | 0.695*** | 1 | |
| perc_homeown | 0.074* | -0.164*** | 1 |
| Computed correlation used pearson-method with listwise-deletion. | |||
Above we see strong correlations between our IVs and our DV. Now,
let’s look at the correlation between IVs. Here’s where we do NOT want
super high correlations. We see the correlation between
perc_homeown & college_educ is only about
-.2 (although that is statistically significant at the .001 level, the
actual correlation magnitude is fairly low).
We already did this step, so we don’t need to do it again.
Next week.
income_formula<-as.formula("income_75 ~ college_educ + perc_homeown")
lm_fit <-
linear_reg() %>%
set_engine("lm")%>%
set_mode("regression")
Next week.
Next week.
lm_results<-
lm_fit%>%
fit(income_formula, data=ad_train)
summary(lm_results$fit)
##
## Call:
## stats::lm(formula = income_75 ~ college_educ + perc_homeown,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.764 -3.427 -0.247 3.089 25.496
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.80022 2.98639 -1.273 0.204
## college_educ 0.66144 0.02975 22.234 < 2e-16 ***
## perc_homeown 0.31637 0.03981 7.948 1.48e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.603 on 460 degrees of freedom
## Multiple R-squared: 0.525, Adjusted R-squared: 0.5229
## F-statistic: 254.2 on 2 and 460 DF, p-value: < 2.2e-16
ad_test<-lm_results%>%
predict(new_data=ad_test)%>%
rename(pred_coll_hown=.pred)%>%
bind_cols(ad_test)
rmse_2<-yardstick::rmse(ad_test,
truth=income_75,
estimate=pred_coll_hown)
rmse_2
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 6.18
The rmse of 6.2 shows that we do get a more accurate prediction in the testing dataset using two predictors.
You MUST remember: correlation is not causation. All you can pick up on using this tool is associations, or common patterns. You can’t know whether one thing causes another. Remember that the left hand side variable could just as easily be on the right hand side.
Run a regression using two different predictors. Calculate rmse and see if you can beat my score.